A/B Testing for Recommender System

The purpose of the research is to assess the results of an A/B test of the improved recommender system.

Objectives:

Technical specifications:

The research process:

  1. Data study.
  2. Evaluation of the correctness of the test.
  3. Exploratory data analysis.
  4. Evaluation of A/B testing results.
  5. General conclusion.

Data study

Load the datasets and save them in the corresponding variables.

Let's look at the first rows of each dataset.

In the regions column of the 'marketing_events' dataset, multiple names of event regions may be stored.

We will check the data types, as well as duplicates and missing values, using the 'info' method.

In all datasets, the date type is 'object', and it needs to be changed to 'datetime'. There are no duplicates. There are missing values in the details column of the 'events' dataset.

Let's look at the missing values in the details column of the 'events' dataset. From the documentation, we know that the details column contains additional information about the event. For example, the cost of the purchase in dollars is stored in this field for the event 'purchase'.

The missing values occur in the events 'login', 'product_cart', and 'product_page'. This means that details are not collected for these events. We will leave the missing values as they are.

Conclusions

We have four datasets:

  1. marketing_events — a calendar of marketing events for the year 2020;
  2. new_users — users who registered from December 7th to December 21st, 2020;
  3. events — actions of new users from December 7th, 2020 to January 4th, 2021;
  4. participants — a table of test participants.

There are no duplicates, and there are missing values in the details column of the 'events' dataset, which we will leave unprocessed: details are not collected for all events.

The format of the columns with date information has been changed from 'object' to 'datetime'.

Evaluation of the correctness of the test

Compliance with the requirements of the technical specification

Dates of conduct according to the Technical Specification

New user recruitment was stopped two days later than the deadline set in the Technical Specification.

The test was stopped on December 30th, 5 days earlier than the deadline set in the Technical Specification. This means that not all users had time to "live" the 14-day testing period.

Audience: 15% of new users from EU region

We are checking if the proportion of test participants from the European Union from all new users from this region who were registered on the resource during the user recruitment period from December 7th to December 21st, is 15 percent.

To determine the region of test participants, we will add information from new_users to the test_participants dataset. We will only keep users from the European Union.

The test participants from the European Union made up 15%, which corresponds to the technical requirement.

Expected number of test participants: 6 000

Test timing

Check if there are any coincidences with marketing and other activities.

In the last 5 days of the test in the European Union and North America, there was Christmas&New Year Promo. We wil consider this information in assessing the test results.

Test audience

Check if there are any intersections with a competing test, as well as users participating in two test groups at the same time. Check the uniformity of distribution to the test groups and the correctness of their formation.

Users also participated in another test — interface_eu_test. Let's check if there are users intersections between the two tests and the recommender_system_test test groups.

Found intersections between tests: 1,602 users participated in two tests.

There are no intersections among the test groups.

The test was stopped on December 30th prematurely and not all users had the opportunity to perform events within 14 days. Let's see how it will affect the test results.

There are more users in the control group.

Let's check the EU users' ratio again.

The EU users' ratio among test participants has decreased to 11.3%. 15% was claimed in the Technical Specification.

Let's prepare data for the research analysis.

The missing values in the date and event name columns indicate that there are users who have not taken any actions. We will replace the missing values in the event_name column with "visit" and in the details column with zero values.

We will also verify that the dates of the events match the dates of the test.

For each user, we also need to take into account only those events that occurred within the first 14 days after registration.

Conclusions

The test deviated from the technical specification:

In group B, there were 1,820 users, and in the control group A, there were 2,425 users.

During the last 5 days of the test, a marketing event "Christmas & New Year Promo" took place in the European Union and North America, which may have affected the test results.

The data was prepared for exploratory data analysis.

Exploratory data analysis

Distribution of event count per user in samples

In both samples, the maximum number of users had no events. In sample B, more than 800 users had no events, despite the fact that this sample was smaller than A.

The distribution of users who have performed at least one event is more likely to be normal with a shift to the left.

The average event count per user in group A is 5.1, in group B — 2.5.

Distribution of event count in samples by day

В контрольной группе пик количества событий приходится на начало второй недели, затем количество событий постепенно снижается. В группе В два пика событий — начало первой недели и начало второй недели, затем также идет снижение.

Conversion in funnels at different stages

Let's look at the events that make up the funnel.

In the dataset, there is a feature 'visit' which means that the user has entered the website but has not taken any action. Then, at the first stage of the funnel, we will have users with the events 'visit' and 'login', i.e. all visitors.

During the registration step, users in group A show a higher conversion rate to all visits — 68.1%, in group B — 32.5%.

In group A, the product page view conversion rate was 64.9% of the registrations. In group B, 55.7%.

In group A, 47.9% of those who viewed the product page went to the shopping cart page, in group B 51.5%.

Next, we see an anomaly in group A: there were more users who paid than visited the shopping cart. Some users may have the technical ability to pay bypassing the shopping cart (for example, in the app).

Also, according to the technical specification, we expect that the number of shopping cart views and purchases will increase by no less than 10% in group B. Since the groups differ in the number of users, we will look at the relative numbers:

Conclusions

The distribution of users who performed at least one event is more likely to be left-shifted normal.

In the control group, the peak in the number of events occurs at the beginning of the second week, then the number of events gradually decreases. In group B, there are two event peaks — at the beginning of the first week and the beginning of the second week, and then there is also a decrease.

At the registration step, users in group A show a higher conversion rate to all visits — 68.1%, in group B — 32.5%. In group A, the product page view conversion rate was 64.9% of registrations. In group B, it was 55.7%.

47.9% of those users, who visited product page, in group A went to the cart page, in group B 51.5%.

There is an anomaly in group A: there were more users who paid than visited the cart. Some users may have the technical ability to pay without the cart (for example, in the app).

Also, according to the technical specifications, we expect that the number of cart views and purchases in group B will increase by at least 10%. Since the groups differ in the number of users, we will look at the relative numbers:

Evaluation of A/B testing results

To evaluate the results, we will create charts that display the dynamics of average check and conversion by groups.

Dynamics of average check in groups

For the charts, we will collect cumulative data. We will create dataframes 'cumulativeRevenueA' and 'cumulativeRevenueB' with columns:

We build a chart of the dynamics of cumulative average check by groups.

From the chart, we can see that after two weeks of testing, the average check charts stabilized. The average check of group B is 4 dollars lower than the average check of group A.

Let's build a chart of the relative change in cumulative average check.

After the chart stabilized, the average check of group B remained more than 15% lower than in group A.

Dynamics of conversion in groups

We will create dataframes 'cumulativeConversionA' and 'cumulativeConversionB' with columns:

Let's build a chart of the dynamics of cumulative conversion by groups.

The conversion charts stabilized after 10 days of testing. Conversion in group B is 9.2%, in group A is 12.6%.

Let's build a chart of the relative change in cumulative conversion.

After chart stabilized, the relative conversion rate of group B was more than 20% lower than group A. According to the technical specification, an improvement in the metric of no less than 10% was expected.

Let's check the statistical difference in proportions using a z-test

According to the technical specifications, we expected an improvement in the conversion rate to view product cards of no less than 10%. We will compare the proportions of customers at all funnel steps.

Hypotheses

The hypothesis that there is no difference between the proportions was not confirmed at any stage: the difference between the proportions is significant.

Conclusions

After two weeks of testing, the cumulative average check graphs stabilized. The average check in group B remained more than 15% lower than in group A.

Conversion graphs stabilized after 10 days of testing. Conversion in group B was 9.2%, in group A 12.6%. The relative conversion of group B was more than 20% lower than group A. According to the technical specification, an improvement in the metric was expected by no less than 10%.

We tested the statistical difference of customer proportions at each stage of the conversion funnel using a z-test. The hypothesis that there is no difference between the proportions was not confirmed at any stage: the difference between the proportions is significant.

General conclusion

The distribution of users who have performed at least one event is more likely to be normally distributed with a shift to the left.

In the control group, the peak number of events occurs at the beginning of the second week, and then the number of events gradually decreases. In group B, there are two peaks of events — at the beginning of the first week and the beginning of the second week, and then a decrease.

According to the technical specifications, we expect that at least 10% of the number of cart views and purchases will increase in group B. The results of the test showed that we did not see this:

After two weeks of testing, the cumulative average check graphs stabilized. The average check in group B remained 15% lower than in group A.

The conversion graphs stabilized after 10 days of testing. Conversion in group B is 9.2%, in group A it's 12.6%. The relative conversion of group B was lower than group A by more than 20%. As per the requirements, the improvement of the metric was expected to be no less than 10%.

The time overlaping of the last 5 days of the test with the Christmas & New Year Promo marketing event did not affect the test results.

We tested the statistical difference of customer shares at each step of the conversion funnel using the z-test. The hypothesis that there is no difference between the shares was not confirmed at the stage of conversions from views to registration and from registration to product card views: the difference between the shares was significant.

The hypothesis that there is no difference between proportions was confirmed at the stage of product views to cart views and cart views to purchases conversions: there is no basis to consider the proportions different on these stages.

As a result of the test, we can conclude that the new recommendation system decreases conversion. However, the test was carried out with significant deviations from the technical specifications:

Therefore, it is recommended to restart the test.